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Abstract — Commute Time Distance (CTD) is a random 
walk based metric on graphs. CTD lias found widespread 
applications in many domains including personalized search, 
collaborative filtering and making search engines robust against 
manipulation. Our interest is inspired by the use of CTD 
as a metric for anomaly detection. It has been shown that 
CTD can be used to simultaneously identify both global and 
local anomalies. Here we propose an accurate and efficient 
approximation for computing the CTD in an incremental 
fashion in order to facilitate real-time applications. An online 
anomaly detection algorithm is designed where the CTD of each 
new arriving data point to any point in the current graph can 
be estimated in constant time ensuring a real-time response. 
Moreover, the proposed approach can also be applied in many 
other applications that utilize commute time distance. 

Keywords -commute time distance; incremental commute 
time; random walk; anomaly detection; 

I. Introduction 

Commute Time Distance (CTD) is a random walk based 
metric on graphs. The CTD{i,j) between two nodes i and 
j is the expected number of steps a random walk starting at i 
will take to reach j for the first time and then return back to 
i. The fact that CTD is averaged over all paths (and not just 
the shortest path) makes it more robust to data perturbations. 

CTD has found widespread applications in personalized 
search [l], collaborative filtering 121, [3] and making search 
engines robust against manipulation |4|. Our interest is in- 
spired by the use of CTD as a metric for anomaly detection. 
It has been shown that CTD can be used to simultaneously 
identify global, local and even collective anomalies in data 
0. 

More advanced measures generally require more expen- 
sive computation. Estimating CTD involves the eigen de- 
composition of the graph Laplacian matrix and consequently 
has 0{n^) time complexity which is impractical for large 
graphs. Saerens, Pirotte and Fouss |6| used subspace approx- 
imation, and Khoa and Chawla |5| used graph sampling to 
reduce the complexity. Sarkar and Moore |7| introduced a 
notion of truncated commute time and a pruning algorithm 
to find nearest neighbors in commute time. They empirically 
demonstrated achieving a near-linear running time as a 



function of graph size. Spielman and Srivastava [8J have 
proposed a near-linear time algorithm for approximating 
pairwise CTD in O(logn) time based on random projec- 
tions. 

However, there are many applications in practice which 
require the computation of CTD in an online fashion. When 
a new data point arrives, the application needs to respond 
quickly without recomputing everything from scratch. The 
algorithms noted above all work in a batch fashion and have 
a high computation cost for online applications. 

We are interested in the following scenario: a dataset D is 
given from an underlying domain of interest. For example, 
data from a network traffic log or environment or climate 
change monitoring. A new data point arrives and we want 
to determine if it is an anomaly with respect to D in CTD. 
Intuitively a new data point is an anomaly if it is far away 
from its neighbors in CTD. 

Example 1: Consider the two graphs shown in Figure [T] 
While the shortest path distance between node 1 and 2 is 
the same in both graphs, CTD{1, 2) increases after node 5 
is added. This property of CTD can be used to great effect 
to detect both global and local outliers. However, the same 
property makes it challenging to calculate the CTD in an 
incremental manner 





(a) A graph of 4 nodes (b) Adding node 5 

Figure 1: CTD (1,2) increases after addition of node 5 
even though shortest path distance remains unchanged. This 
property of CTD has many applications including anomaly 
detection. 

Here we propose and compare two methods for computing 
CTD in an incremental fashion. The first method is based 
on an incremental update of the eigen decomposition of a 
Laplacian matrix. The second method uses the recursive 



definition of CTD based on hitting time. To the best of 
our knowledge both these methods are novel and their 
comparison provides revealing insights about CTD which 
are independent of the application domain. 
The contributions of this paper are as follows: 

« We make use of the characteristics of random walk to 
estimate CTD incrementally in constant time. The same 
approach could be used for estimating the hitting time 
incrementally. 

« We propose a provably fast method to incrementally 
update the eigenvalues and eigenvectors of the graph 
Laplacian matrix. This method can be integrated with 
any technique requiring a graph spectral computation, 
such as spectral clustering. 

• We design an online algorithm for anomaly detection 
using incremental CTD. The technique is verified by 
experiments in synthetic and real datasets. The experi- 
ments show the effectiveness of the proposed methods 
in terms of accuracy and performance. 

The remainder of the paper is organized as follows. 
Section |ll] reviews notation and concepts related to random 
walk and CTD, and a simple example to tie up all the 
definitions and ideas. In Sections Hill and HVl we present two 
methods to incrementally approximate the CTD. In Section 
rvl we introduce an online anomaly detection algorithm 
which uses incremental CTD. In Section IVll we evaluate our 
approach using experiments on synthetic and real datasets. 
Sections IVIII covers related work. We conclude in Section 
IVIIII with a summary and a direction for future research. 

II. Commute Time Distance 

We provide a self-contained introduction to random walks 
with an emphasis on CTD. Assume we are given a connected 
undirected and weighted graph G = {V, E, W). 

Definition 1: Let i be a node in G and N(i) be its 



neighbors. The degree di of a node i is J^jeNU) '^ij- The 
volume Vg of the graph is defined as X^igy '^i- 

Definition 2: The transition matrix M — {pij)i,jev of a 
random walk on G is given by 



Pij 



0, otherwise 



Definition 3: Let Pq be an initial distribution on G. 
Define Pt = (M^)*Po for all t > 0. A distribution Pq 
is stationary if Pi = Pq ID. 

Fact 1: The distribution Pq defined by 7r(w) = -^ for all 
I! G y is a stationary distribution f9l. 

Definition 4: A random walk is time- reversible if for 
every pair of nodes i,j e V, ■n{i)pij — T^{j)Pji |9|. 

Definition 5: The Hitting Time hij is the expected num- 
ber of steps that a random walk starting at i will take before 
reaching j for the first time. 



Definition 6: The Hitting Time can be defined in terms 
of the recursion 



E 



l£N(i) 



Ptihij 



if i^j 
otherwise 



Definition 7: The Commute Time Distance c^ between 
two nodes i and j is given by Cij = hij + hji. 

Fact 2: CTD is a metric: (i) cu = 0, (ii) Cij — Cji and 
(iii) Cij < Cik + Ckj lHO). 

Remarkably, CTD can be expressed in terms of the 
Laplacian of G. 

Definition 8: Let D be the diagonal degree matrix and A 
be the adjacency matrix of G. The Laplacian of G is the 
matrix L = D — A. 

Fact 3: 1) Let e^ be the V dimensional column vector 
with a 1 at location i and zero elsewhere. 

2) Let {Xi,Vi) be the eigenpair of L for all i G V, i.e., 
Lvi = AjWj. 

3) It is well known that Ai = 0, ui = (1,1,..., 1)^ and 
all A, > 0. 

4) Assume = Ai < A2 . . . < A|y|. 

5) Then the pseudo-inverse of L denoted by L~^ is 

in I 

i=2 * 



Fact 4: 



Vailt + l 



2l+) = VG{e^-ejfL+{e,-ej) (1) 



where /^ is the (i, j) element of i+ JS). 

Example: Again, consider the graph G shown in Figure 
[Tal where all the edge weights equal to 1 . The sum of the 
degree of nodes, Vg ~ 8. We will calculate the commute 
time C12 in two different ways: 

1) Using random walk: note that the expected number 
of steps for a random walk starting at node 1 and 
returning back to it is ^ = j = 8 [9]. But the walk 
from node 1 can only go to node 2 and then return 
from node 2 to 1. Thus C12 = 8. 

2) Using algebraic approach: the Laplacian matrix is 



/ 



V 



-1 

3 

-1 
-1 




-1 

2 
-1 



-1 



and the pseudo-inverse is 

/ 0.69 
-0.06 
-0.31 

V -0.31 

Now C12 — Vciei — 



L+ = 



-0.06 

0.19 

-0.06 

-0.06 



-0.31 

-0.06 

0.35 

0.02 



-0.31 
-0.06 

0.02 

0.35 / 



e2)^L+(ei - 62) and 



) 



0.69 -0.06 

-0.06 0.19 

-0.31 -0.06 

-0.31 -0.06 



-0.31 -0.31 

-0.06 -0.06 

0.35 0.02 

0.02 0.35 



Thus ci2 = Vg X 1 = 8. 

Suppose we add a new node (labeled 5) to node 4 with 
a unit weight as in Figure [Tbl Then c"|"' — V^'^'^/di — 
10/1 = 10. 

The example in Figure [lb] shows that by adding an edge, 
i.e. making the "cluster" which contains node 2 denser, Ci2 
increases. This shows that CTD between two nodes captures 
not only the distance between them (as measured by the edge 
weights) but also their neighborhood densities. For the proof 
of this claim, see |5|. This property of CTD has been used to 
simultaneously discover global and local anomalies in data 
- an important problem in the anomaly detection literature. 

In the above example, we exploited the specific topology 
(degree one node) of the graph to calculate CTD efficiently. 
This can only work for very specific instances. The general, 
more widely used but slower approach for computing CTD 
is to use the Laplacian formula. A key contribution of this 
paper is that for incremental computation of CTD we can 
use insights from this example to accurately and efficiently 
compute the CTD in much more general situations. 

III. Incremental Eigen Decomposition of Graph 
Laplacian 

In this section, we propose a method to incrementally 
update the eigensystem (eigenvalues and eigenvectors) of the 
Laplacian when a new node along with edges to its neighbors 
is added to the underlying graph. The unique feature of our 
approach as opposed to that of Ning et. al. ifTTl are (i) our 
emphasis is on handling the addition of a new node and the 
corresponding edges to its nearest neighbors as opposed to 
just weight updates on existing edges and (ii) simultaneous 
updating of all weight edges as opposed to one edge at a 
time. 

A. Iterative incremental update of the Laplacian eigensys- 
tem 

We propose an algorithm based on the following propo- 
sition to incrementally update the eigensystem (A, v) of the 
Laplacian L when a new node i is added to the graph. 
Suppose there are k edges e — (i,j) e En with weight 
We added to the graph from i. Denote AL and (AA, Aw) 
be changes of L and (A, v) resulting from the addition of i. 
Note that the size of matrix L and its eigenvector v change 
as mentioned in the Appendix. 

Proposition 1: The solution of the eigensystem 

{L + AL)(i; + Aw) = (A + AA)(u + Aw) 

can be derived from the solution of the following set of 
simultaneous equations. 

EeeB„ We[v(i) - v{i)][v{i) ~ v(i) + Aw(i) - Au(j)] 



AA = 



where 



and 



if = L + Ai - (A + AA)/, 
h = (AA/ - AL)w. 



(4) 
(5) 



For the proof, see Appendix. 

Note that in general it is not practical to solve the system 
Aw = K^^h at the arrival of each new data point i. In 
practice, as noted by Ning. et. al. fTT), we can set Aw(fc) — 
for all components which are not i, its first or second order 
neighbors. 

Denote Ni — {j\d{i,i) < 2}, where d{i,j) is the shortest 
path between i and j. Let /sTjv be the matrix derived from K 
after removing columns which do not correspond to nodes 
in Ni, vn and Awat be the vectors derived from w and Aw 
after removing elements which do not correspond to nodes 
in Ni. Since Kj\i is not a square matrix, we obtain: 



Aw - {KlKr,)-^Klh, 



(6) 



AA 



1 + w^Aw 



Av = K-'^h 



(2) 
(3) 



Segi;„ Web(i) - vU)][vii) - v{j) + Aw(i) - Aw(j)] 

1 + w], Awat 

(7) 
Since AA in Equation [T] depends on the value of Aw in 
Equation |6] and vice versa, we can update the values of A A 
and Aw as follows. We initialize the values Aw = to 
update the value of A A and then using that to update Aw. 
The procedure is repeated until convergence. Algorithm [T| 
gives the details. 

Algorithm 1 Incremental update eigenvalues and eigenvec- 
tors of Laplacian matrix L 

Input: Laplacian matrix L, its eigenvalues S and 
eigenvectors V, weights We of all the new edges 
Output: New eigenvalues Sn and eigenvectors Vn 

1 : for each eigenvalue and eigenvector do 

2: Set Aw = 

3: Update A A using Equation [T] 

4: Update Aw using Equation |6] 

5: Repeat steps 3 and 4 until there is no significant 

change in AA or until the loop reaches a maximum 

iterations 
6: w„ = w + Aw, A„ = A + AA 
7: end for 



IV. Incremental Estimation of Commute Time 
Distance 

In this section, we derive a new method for computing 
the CTD in an incremental fashion. This method uses the 
definition of CTD based on the hitting time. The basic 
intuition is to expand the hitting time recursion until the 
random walk has moved a few steps away from the new 



node and then use the old values. In Section|Vl]we will show 
that this method results in remarkable agreement between 
the batch and online mode. 

We deal with two cases shown in Figure |2] 




(a) Rank 1 (b) Rank k 

Figure 2: Rank 1 and rank k perturbation 

1) Rank one perturbation corresponds to the situation 
when a new node connects with one other node in 
the existing graph. 

2) Rank k perturbation deals with the situation when the 
new node has k neighbors in the existing graph. 

The term rank is used as it corresponds to the rank of the 
perturbation matrix Ai. 

A. Rank one perturbation 

Proposition 2: Let i be a new node connected by one 
edge to an existing node I in the graph G. Let wu be the 
weight of the new edge. Let j be an arbitrary node in the 
graph G. Then 

old , ^G 



'^ij ~ ^Ij 



Wil 



(8) 



where 'old' represents the CTD in graph G before adding i. 
Proof: (Sketch) Since the random walk needs to pass I 
before reaching j, the commute distance from i to j is: 



It is known that; 



Czj — Cii -r Cij . 



Cil 



Wil 



(9) 



(10) 



where Vq is volume of graph G |31. We also know cij ~ 
hji + hij and hji = h°\'^. The only unknown factor is hij. 
By definition: 

Plqhqj=l+ ^ Plqhqj + puhij . 

q£N[l),q^t 



hij = 1 



7 . Pi-q'-q] 

qeN{l) 



Since h, 



1] 



old 



Piq = {l-pii)pTq, and hij = 1 + hij. 



hij « 1 + E (1 - PH)pt^Kf + P''(l + hij) 
= ! + (!- pu) Y. <Xf + P'»(l + ^y) 

qeN{l)^q^l 

= ! + (!- pu){h°^^ ~ l)+pu{l + hi,). 

After simplification, hi, ^ hf/ -^ ^£i^ 
Then ci, w /i°," + h°^'^ ^ ^^ 



i-Pii- 



Since there is only one edge connecting from i to G, i is 
likely an isolated point and thus pu <^ 1. Then 



Clj 



l^old 

n-ji 



hll" 



old 
Clj ■ 



(11) 



As a result from equations |9] [TOl and (TT] 



r Clj 

Wil 



„old , ^G 
Wil 



B. Rank k perturbation 

The rank k perturbation analysis is more involved but the 
final formulation is an extension of the rank one perturba- 
tion. 

Proposition 3: Denote / G G be one of k neighbors of i 
and j be a node in G. The approximate commute distance 
between nodes i and j is: 

Cij ~ 2^ P^lClj + -J- 

l&N{i) ' 



(12) 



i-Pii 



For the proof, see Appendix. 

V. Online Anomaly Detection Algorithm 

We return to our original motivation for computing incre- 
mental CTD. We are given a dataset D which is representa- 
tive of the underlying domain of interest. We want to check 
if a new data point is an anomaly with respect to D. We 
will use the CTD as a distance metric. 

This section describes an online anomaly detection system 
using the incremental update of the eigensystem of the 
Laplacian in Section |lll] and the incremental estimation of 
commute time in Section |TV] 

Generally CTD is robust against small changes or pertur- 
bation in data. Therefore, only the anomaly score of the new 
data point needs to be estimated and be compared with the 
anomaly threshold in the training data. This claim will be 
verified by experiment in Section IVII 

A. CTD-based anomaly detection 

This section reviews the batch method based on CTD to 
detect anomalies f5|. The method is described in Algorithm 
|2] First, a mutual /ci -nearest neighbor graph is constructed 
from the dataset. Then the graph Laplacian matrix L, its 
eigenvectors V and eigenvalues S are computed. Finally, a 
CTD distance-based anomaly detection with a pruning rule 
proposed by Bay and Schwabacher lfT2l is used to find the 
top A^ anomalies. The anomaly score used is the average 
CTD of an observation to its fc2 nearest neighbors. 

Pruning Rule |12|: A data point is not an anomaly if its 
score (e.g. the average distance to its k nearest neighbors) is 
less than an anomaly threshold. The threshold can be fixed 
or be adjusted as the score of the weakest anomaly found 
so far Using the pruning rule, many non-anomalies can be 
pruned without carrying out a full nearest neighbors search. 



Algorithm 2 CTD-Based Anomaly Detection 
Input: Data matrix X, the numbers of nearest neighbors 
ki (for building the /c-nearest neighbor graph) and fc2 (for 
estimating the anomaly score), the number of anomalies to 
return A^ 

Output: Top A^ anomalies 
1: Construct the mutual /c-nearest neighbor graph from the 

dataset (using fci) 
2: Compute the graph Laplacian matrix L, its eigenvectors 

V and eigenvalues S 
3: Find top N anomalies using the CTD based technique 
with pruning rule (using k2)- Each CTD query uses 
Equation [T] 
4: Return top N anomalies 



Algorithm 4 Online Anomaly Detection using the incre- 
mental Estimation of Commute Time (iECT) 
Input: Graph G, Laplacian matrix L, its eigenvalues S and 
eigenvectors V, the anomaly threshold r of the training set, 
and a test data point p 
Output: Determine if p is an anomaly or not 

1: Add p to G using the mutual nearest neighbor graph, 

we have a new graph G„ 
2: Determine if p is an anomaly or not by estimating its 

anomaly score using incremental CTDs mentioned in 

Section HV] Use pruning rule with threshold r to reduce 

the computation 
3: Return whether p is an anomaly or not 



B. Online Algorithms 

Algorithm |3] (denote as iLED) is a method to detect 
anomalies online using the eigenvalues and eigenvectors of 
the new Laplacian matrix which is updated incrementally. 
Algorithm |4] (denote as iECT) on the other hand is a method 
to detect anomalies online using incremental estimation of 
commute time based on hitting time. 



Algorithm 3 Online Anomaly Detection using incremental 

Laplacian Eigen Decomposition (iLED) 

Input: Graph G, Laplacian matrix L, its eigenvalues S and 

eigenvectors V, the anomaly threshold t of the training set, 

and a test data point p 

Output: Determine if p is an anomaly or not 

1: Add p to G using the mutual nearest neighbor graph, 
we have a new graph G„ 

2: Incrementally compute the new eigenvalues and eigen- 
vectors of the new Laplacian _L„ using Algorithm [T] 

3: Use Gram-Schmidt process |13| to orthogonalize the 
new eigenvectors 

4: Determine if p is an anomaly or not by estimating 
its anomaly score using CTDs derived from the new 
eigenpairs. Use pruning rule with threshold r to reduce 
the computation 

5: Return whether p is an anomaly or not 



When a new data point p arrives, it is connected to graph 
G created in the training phase. The CTDs are incrementally 
updated to estimate the anomaly score of p using the 
approach in sections |lll] and |lVl For iLED algorithm, after 
updating the eigenvalues and eigenvectors, we use Gram- 
Schmidt process Iil3il to normalize and orthogonalize the 
eigenvectors. 



C. Analysis 

First, we analyse the incremental eigen decomposition of 
the Laplacian in Section|lII] Here n is the size of the original 
graph (Laplacian) and N is the neighborhood size used in 
Kn (i.e., cardinaHty of Ni). Note N <^n. 

It takes constant time to update AA and 0{N'^n) to 
compute X = KJ^Kn, OiN^) for X-^, 0{Nn) for 
y = Kjfh and 0{N'^) for Aw = X'^y. Since TV < n, we 
obtain 0{n) time for the incremental update of eigenvalues 
and eigenvectors of the Laplacian. 

On the other hand, incremental estimation of commute 
time update in Section |IV] requires 0{m) for each query 
of c°^'^ where m is the number of eigenvectors used. So if 
there are k edges added to the graph, it takes 0{km) for 
each query of CTD. 

Since we only need to compute the anomaly score of a 
test data point using the pruning rule with the threshold of 
anomaly score in the training set, it takes only 0(fc2) nearest 
neighbor search to determine if the test point is an anomaly 
or not where fc2 is the number of nearest neighbors for 
estimating the anomaly score. For each CTD query, it takes 
0{m) for Algorithm |3] (iLED), and 0{km) for Algorithm 
El (iECT). Therefore, iLED takes 0{n + k2m) = 0{n), and 
iECT takes 0{k2km) ~ 0(1) to determine if a new arriving 
point is an anomaly or not. Note that since L and K are 
sparse, we can get better than 0{n) for iLED. 

VI. Experiments and Results 

We report on the experiments carried out to determine and 
compare the effectiveness of the iECT and iLED methods. 
To recall, iECT uses the recursive definition of hitting time 
to calculate CTD while iLED uses the Laplacian definition. 

Approach: We split a data set into two parts: training and 
test. We use Algorithm |2] to compute the top N anomalies in 
the training set and use the average distance of a data point 
to its ^2 nearest neighbor (in CTD) as its anomaly score. 
The weakest anomaly in the top N set is one which has 



the smallest average distance to its nearest neighbors and is 
used as the threshold value r. Then the anomaly score of 
each instance p in the test set is calculated based on its fc2 
neighbors in the training set. If this score is greater than r 
then the test instance is reported as an anomaly. During the 
time searching for the nearest neighbors of p, if its average 
distance to the nearest neighbors found so far is smaller than 
T, we can stop the search as p is not anomaly (pruning rule). 
Data and Parameters: The experiments were carried out 
on synthetic as well as real datasets. We chose the number 
of nearest neighbors fci ~ 10 to build the mutual nearest 
neighbor graph, /c2 = 20 to estimate the anomaly score, the 
number of Laplacian eigenvectors m = 50. In Algorithm [l] 
the threshold to estimate the change of AA was 10^^ and 
the maximum iterations was 5. The choice of parameters 
was determined from the experiments. In all experiments, 
the batch method was used as the benchmark. The anomaly 
threshold r was set based on the training data. It was the 
score of the weakest anomaly in the top iV = 50 anomalies 
found by Algorithm |2] in the training set. 

A. Synthetic datasets 

We created six synthetic datasets, each of which contained 
several clusters generated from Normal distributions and a 
number of random points generated from uniform distribu- 
tion which might be anomalies. The number of clusters, the 
sizes, and the locations of the clusters were also chosen 
randomly. Each dataset was divided into a training set and 
a test set. There were 100 data points in every test set and 
half of them were random anomalies mentioned above. 

Experiments on Robustness: We first tested the robustness 
of CTD between nodes in an existing set when a new data 
instance is introduced. As CTD{i,j) between nodes i and 
j is a measure of expected path distance, the hypothesis is 
that the addition of a new point will have minimal influence 
on CTD{i, j) and thus the anomaly scores of data points in 
the existing set are relatively unchanged. 

Table H] shows the average, standard deviation, minimum, 
and maximum of anomaly scores of points in graph G before 
and after a new data point was added to G. Graph G was 
created from the 1,000 point dataset in Figure [3] The result 
with test point was averaged over 100 test points in the 
test set. The result shows that the anomaly scores of data 
instances in G do not change much. This shows CTD is a 
robust measure, a small change or perturbation in the data 
will not result in large changes in CTD. Therefore, only the 
anomaly score of the new point needs to be estimated. 

Experiments on Effectiveness: We applied iECT and 
iLED to all the datasets. The effectiveness of the iECT 
algorithm over iLED is shown in Figure |4] There were 36 
anomalies detected by batch method using CTD which are 
shown in Figure |3] iECT captured all of them and had 8 
false positives whose scores were close to the threshold. 
iLED approach, on the other hand, had better precision with 



Table I: Robustness of CTD. The anomaly scores of data 
instances in existing graph G are relatively unchanged when 
a new point is added to G. 





Average 


Std 


Min 


Max 


Without test point 


10,560.61 


65,023.18 


6.13 


1,104,648.09 


Withi test point 


10,517.94 


64,840.39 


6.13 


1,101,169.36 
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Figure 3: 1,000 points dataset with training and test sets 



no false positives but worse recall with only 15 anomalies 
found. The reason was anomalies have more effect on the 
eigenvectors and eigenvalues of the graph Laplacian and thus 
iLED was unable to capture many of them. 




M 100 



Figure 4: Accuracy of iLED and iECT in 1,000 points 
dataset. iECT detects anomalies better than iLED. 

To get a better understanding. Figure |5] shows the eigen- 
values and eigenvectors of the new graph computed using 
batch method and iLED in 1,000 points dataset where 
the test point was an anomaly and a non-anomaly. The 
eigenvalue graph shows the top 50 smallest eigenvalues and 
the eigenvector graph shows the dot product of the top 50 
smallest eigenvectors of the new graph Laplacian for both 
batch and iLED methods. When a new data point was a non- 



anomaly, the approximate eigenvalues and eigenvectors were 
significantly more accurate than those when a new point was 
an anomaly. 



-^— iLED 
X Anomalies 





(a) Normal point - Eigenvalues (b) Normal point - Eigenvectors 




(c) Anomaly - Eigenvalues (d) Anomaly - Eigenvectors 

Figure 5: Accuracy of iLED method. iLED has better 
approximation when a test point is not an anomaly. 

Table |II] shows the results in accuracy and performance 
of iLED and iECT in the synthetic datasets. Average score 
was the average anomaly score over 100 test points. The 
precision and recall were for the anomalous class. The time 
was the average time to process each of 100 test points. iECT 
generally had better approximation than iLED, did not miss 
any anomaly and had acceptable false alarms. iLED, on the 
other hand, did not have any false alarms but did miss many 
anomalies. Both of them were more efficient than the batch 
method. Note that the scores shown here were the anomaly 
scores with pruning rule and the scores for anomalies are 
always much higher than scores for normal points. Therefore 
the scores were dominated by the scores of anomalies and 
that is the reason why iLED had much lower scores. In fact, 
iLED is more accurate than iECT in estimating the scores 
of normal points. 

There is an interesting dynamic at play between the 
anomaly, pruning rule, iECT, iLED, and the number of 
anomalies in the data. iECT was slightly slower than iLED in 
the experiment. The reason is we have many anomalies in the 
test set. We know that the pruning rule only works for non- 
anomalies. Moreover, iLED is faster per CTD query com- 
pared to iECT. Therefore, for anomalies, iLED is generally 
faster. Furthermore, because iLED tends to underestimate the 
scores of anomalies (and that was the reason it missed some 
anomalies), anomalies are treated as non-anomalies and the 
pruning kicks-in making it faster. In practise, since most of 
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Figure 6: Performance of iLED and iECT in 50,000 points 
dataset. Except a few false alarms, iECT is generally faster 
than ilED when a test point is not an anomaly. 



the test points are not anomalies, iECT will be more efficient 
than iLED. It is shown in Figure |6] where except a few 
false alarms, iECT was generally faster than iLED in 50,000 
points dataset when test instances were not anomalies. The 
same tendency also happened in other datasets used in the 
experiments. 

B. Real Datasets 

1) DBLP dataset: In this section, we evaluated the iECT 
method on the DBLP co-authorship network. Nodes are 
authors and edge weights are the number of collaborated 
papers between the authors. Since the graph is not fully con- 
nected, we extracted its biggest component. It has 344,800 
nodes and 1,158,824 edges. 

We randomly chose a test set of 50 nodes and removed 
them from the graph. We ensured that the graph remained 
connected. After training, each node was added back into 
the graph along with their associated edges. 

Since the size of the graph is very large, normal training 
using the batch mode in Algorithm |2] is not feasible. Instead 
we implemented the approximate method proposed by Spiel- 
man and Srivastava (SS) |8| and used the underlying linear 
time CMG solver proposed by Koutis |14|. The SS methods 
combines random projections with a linear time solver for 
diagonally dominant matrices to approximate the CTD. The 
SS method creates a matrix Z from which CTD between two 
nodes can be computed in fc = O(logn) time with provable 
accuracy. In practice we can use k to be much smaller than 
0(log7i) and still attain highly accurate results. 

We trained the graph using the SS approach, stored the 
matrix Z and used Z to query the c°-'' in iECT algorithm. 
The batch method here is the CTD approximation using the 
matrix Znew created from the new graph after adding each 
test data point. The parameter for random projection was 
k = 200. 



Table II: Effectiveness of incremental methods. iECT generally does not miss any anomaly and has acceptable false alarms. 
iLED, on the other hand, does not have any false alarms but misses many anomalies. 



Dataset 


iLED 


iECT 


Batch 1 


Size 


Avg Score 


Precision (%) 


Recall (%) 


Time (.s) 


Avg Score 


Precision (%) 


Recall (%) 


Time (s) 


Avg Score 


Time (s) 


1,000 


3.84 X 10" 


100 


41.7 


0.13 


1.69 X 10-' 


81.8 


100 


0.20 


1.61 X 10^ 


0.17 


10,000 


7.09 X 10^ 


100 


53.2 


1.28 


4.87 X lO^ 


95.9 


100 


1.42 


4.99 X lO*' 


2.04 


20,000 


5.36 X ID'S 


100 


83.3 


2.64 


1.75 X 10'^ 


80.0 


100 


2.96 


1.70 X lO'' 


4.53 


30,000 


4.81 X 10'' 


100 


39.6 


3.68 


1.39 X 10* 


96.0 


100 


4.33 


1.40 X 10* 


7.13 


40,000 


3.27 X ID'S 


100 


15.6 


4.44 


5.17 X 10^ 


71.1 


100 


5.19 


4.88 X lO'' 


9.05 


50,000 


8.88 X lO'' 


100 


32.5 


5.70 


6.15 X 10^ 


87.0 


100 


6.61 


5.96 X 10^ 


11.60 



The result shows that it took 0.0066 seconds on average 
over 50 test data points to detect whether each test point 
was an anomaly or not. The batch method, which is the 
fastest approximation of CTD to date, required 944 seconds 
on average to process each test data point. This dramatically 
highlights the constant time complexity of iECT algorithm 
and suggests that iECT is highly suitable for the compu- 
tation of CTD in an incremental fashion. Since there was 
no anomaly in the random test set, we cannot report the 
detection accuracy here. The average anomaly score over 
all the test points of iECT was 1.1 times higher than the 
batch method. This shows the relatively high accuracy of 
iECT approximation even in a very large graph. 

2) KDD Cup 1999 datasets: We used the 10% dataset 
from the KDD cup 1999 competition provided by UCI 
Machine Learning Repository ifTSl . It was used to build 
detection tools of network attacks or intrusions. Since the 
dataset is huge and there are more anomalies than normal 
instances, we sampled 2,200 data points from it where there 
were 2,000 normal points and 200 anomalies (network intru- 
sions). Categorical features were ignored and 38 numerical 
features were used. The dataset was divided into a training 
set and a test set with 100 data points. 

iLED and iECT were applied on this dataset and min- 
max scaling was used as data normalization. iLED had a 
precision of 100% and a recall of 66.7% while iECT had a 
precision of 75% and a recall of 100%. The average anomaly 
scores of iLED and iECT were 2% and 1% lower than that 
of the batch method, respectively. 

3) NICTA datasets: The dataset is from a wireless mesh 
network which has seven nodes deployed by NICTA at the 
School of IT, University of Sydney lfT6l . It used a traffic 
generator to simulate traffic on the network. Packets were 
aggregated into one-minute time bins and the data was 
collected in 24 hours. There were 391 origin-destination 
flows and 1,270 time bins. Some anomalies were introduced 
to the network including DOS attacks and ping floods. The 
dataset was divided into a training set and a test set with 
100 data points. 

iLED and iECT were applied on this dataset and min- 
max scaling was used as data normalization. iLED had a 
precision of 100% and a recall of 27.3% while iECT had 
a precision of 84.6% and a recall of 100%. The average 



anomaly scores of iLED and iECT were 20% lower and 
2% higher than that of the batch method, respectively. The 
tendency of the detection here of iLED and iECT are also 
similar to those of the synthetic datasets. 

C. Summary and Discussion 

The experimental results on both synthetic and real 
datasets show that iECT generally has better detection 
ability than iLED. iECT has very high recall and acceptable 
precision. iLED, on the other hand, has very high precision 
but low recall. Both of them are faster than the batch 
method. The results on real datasets collected from different 
sources also have similar tendency showing the reliability 
and effectiveness of the proposed methods. 

The experiments also reveal that iLED tends to under- 
estimate the CTDs for anomalies while iECT tends to 
overestimate the CTDs for non-anomalies. It leads to a high 
precision for iLED and a high recall for iECT. If we can 
come up with a strategy to combine the strengths of the two 
methods, we can have a more accurate estimation. iECT is 
faster than iLED but it can only be used in case where a 
new test point is added and cannot be used when there are 
weigh updates in the graph. iLED on the other hand can be 
used in both cases by just changing the perturbation matrix. 

VII. Related work 

Incremental learning using an update on eigen decomposi- 
tion has been studied for a long time. Early work studied the 
rank one modification of the symmetric eigen decomposition 
ifTTl . ifTSl . lfT9l . The authors reduced the original problem to 
the eigen decomposition of a diagonal matrix. Though they 
can have a good approximation of the new eigenpair, they 
are not suitable for online applications nowsaday since they 
have at least O(n^) computation for the update. 

More recent approach was based on the matrix perturba- 
tion theory ll20l . ||2T1 . They used the first order perturbation 
analysis of the rank-one update for a data covariance matrix 
to compute the new eigenpair. The algorithms have a linear 
time computation. The advantage of using the covariance 
matrix is if the perturbation involving an insertion of a new 
point, the size of the covariance matrix is unchanged. This 
approach cannot be applied directly to increasing matrix size 
due to an insertion of a new point. For example, in spectral 
clustering or CTD-based anomaly detection, the size of the 



graph Laplacian matrix increases when a new point is added 
to the graph. 

Ning et. al [lU proposed an incremental approach for 
spectral clustering with appUcation to monitor evolving blog 
communities. It incrementally updates the eigenvalues and 
eigenvectors of the graph Laplacian matrix based on a 
change of an edge weight on the graph using the first order 
error of the generalized eigen system. 

VIII. Conclusion 

The paper shows two novel approaches to compute CTD 
incrementally. The first one incrementally updates the eigen- 
vectors and eigenvalues of the graph Laplacian matrix. It is 
linearly scaled and can be applied to the estimation of CTD 
incrementally or any application involving graph spectral 
computation. The second approach incrementally estimates 
CTD in constant time using the property of random walk 
and hitting time. We design novel anomaly detection algo- 
rithms using two approaches to detect anomalies online. The 
experimental results show the effectiveness of the proposed 
approaches in terms of performance and accuracy. It took 
less than 7 milliseconds on average to process a new arriving 
point in a graph of more than 300,000 nodes and one million 
edges. Moreover, the idea of this work can be extended in 
many applications which use the CTD and it is the direction 
for our future work. 
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Appendix 

We provide relation between perturbation and incidence 
matrix, and the proof details of Propositions [T] and [3] 

A. Incidence matrix and perturbation 

It is well known that the Laplacian of a graph can be 
expressed in terms of an incidence matrix. 

Definition 9: Given a weighted graph G — {V, E, W) and 
an arbitrary but fixed orientation of the edges, the incidence 
matrix R is a \V\ x |_E| matrix where the columns of the 
matrix are defined as 

/w at location w if w is the head of e 
rp{w) = •^ — a/w at location w if w is the tail of e 







otherwise 



(13) 

Note that r^ is a column vector of size \V\. We can also 
express each r^ ~ -/wu,, where u^. is a column vector with 
entries 1, -1 at locations corresponding to head and tail of 
e and at other locations. 

Fact 5: If L is a graph Laplacian then L = RR^ Il22l . 

Fact 6: If an edge e undergoes a similarity change Awe, 
the new graph Laplacian L„ is L„ = RnR^n where Rn=[R 
re(Awe)] II U . Therefore, a change in an edge weight can 
be represented by appending an incidence vector to R. 

Proposition 4: Denote R as an incidence matrix of a 
given graph G = {V, E, W). Suppose a new node is added 
to the graph which results in k new edges. Let i?„ represent 
the new |y + 1| x |i? + A:| matrix and let AR represent the 
matrix of new incidence vectors of size |y + 1| x k. Then 
the new Laplacian L„ can be expressed in terms of the old 
Lapalcian L as 



Lr. 



L 




AL 



Also note that if (A,w) is an eigenpair of L then 



A, 



is an eigenpair of 



Proof: Let £'„ be the set of new edges resulting from 
the addition of a new node. Then 



J^n^n 



R 



L 




AR 



AR 



E 

ee£„ 



WeUeul 



L 




AL = Ln 



B. Incremental Update of Eigenvalues and Eigenvectors 

Proof for Proposition [TJ 

Proof: Given the eigen decomposition of the new 
Laplacian matrix: 

{L + AL){v + Av)^{X + A\){v + Av) (14) 

The perturbation is: 



Ai=E 



(15) 



e6B„ 



Since Lv — Xv, 

LAv + ALv + ALAv = AXv + XAv + AAAw (16) 
Left multiply both sides by v^: 
v^LAv+v'^ALv+v'^ALAv = v^ AXv+v^ XAv+v^ AXAv 
Since v^L — Xv^ (L is symmetric): 

v'^AX{v + Av) = v^AL{v + Av) 
Then we have the update of the eigenvalue A: 
v'^AL{v + Av) v^AL(y + Av) 



AA = 



v'^[v + Av) l + ^TAw 

"^ YjedE^ WeUeUliv + Av) 

1 + v'^Av 
J2eeE„ We[v{i) ~ v{j)][v{i) - v{j) + Av{i) - Av{j)] 



1 + v'^Av 



(17) 



From equation [T6] we have: 

[L + AL-{X + AX)I]Av = (AA/ - AL)v 

Denotes ii' = L+Ai-(A+AA)/and /i = {AXI-AL)v, 
we have Av — K^^h. ■ 

C. Rank k Perturbation 

Lemma 1: Denote ^ G G is one of k neighbors of i and 
i is a node in G. We have: 

y ^ Piihii == -3- + 1- 



lGN{i) 



di 



Proof: Using the reversibility property of the random 
walk, it is easy to prove that the expected number of steps 
that a random walk which has just visited node i will take 
before returning back to i is graph-volume/ di ID. 



In case of i, this distance equals to the distance from i to 
one of its neighbors / (one step) plus the hitting time hu. 
Since the random walk goes from i to / with the probability 
Pii, we have 1 + J2ieN{i)Pii^ii = "^^^d~- Therefore, 

Proof for Proposition |3j 

Proof: (Sketch) By definition, 

ieN{i) leN(i) qeN{i) 

Using the same approach as the rank one case, 

h^J - 1 + E K^l + (1 - Pl^) E <Kf + Pl^^^j\ 

leN{i) qeN{l),q^i 

= 1+ E P^li'^ + (^ - Pl^)ihf/ - 1) + Puh^j] 

l€N{i) 

After a few manipulations, we have 

_^+Y,l£N{i)Pi^^°j ~J2leN{i)PilPl-iMj +'l2l£N{i)Pi^P^i 

J- — 2^ieN(i)PiiPii 
Because Y.ieN(i) PuPn ^ 1' 

h,,^l+ E P^lh'i]''■ (18) 

Since the commute distance between two nodes is the 
average of all possible path-length between them, hji « 
i '^ieN(i)i^ji + ^li)- Instead of using the normal average, 
we take into account the probability pu: 

h]i~ E P'Ui^jl -^ ^li) ^ E P^'-^ji + E Pil^'i 
l£N(i) l£N{i) leN(i) 

(19) 
We have hji w h°l'^. Moreover, from Lemma [T] we have 

Y.i(^N{i)Pahu = ^ + 1. Then from[T9l 

h,, « E P-i^t + "X + ^ ^2^) 

As a result of equations [18] and |20] 

c.,«i+ E p^< + ^ + i« E P'< + ^ 

When fc = 1 (rank one case), the formula becomes 
Equation [8] ■ 



